NetProphet 3: a machine learning framework for transcription factor network mapping and multi-omics integration

Abstract Motivation Many methods have been proposed for mapping the targets of transcription factors (TFs) from gene expression data. It is known that combining outputs from multiple methods can improve performance. To date, outputs have been combined by using either simplistic formulae, such as geometric mean, or carefully hand-tuned formulae that may not generalize well to new inputs. Finally, the evaluation of accuracy has been challenging due to the lack of genome-scale, ground-truth networks. Results We developed NetProphet3, which combines scores from multiple analyses automatically, using a tree boosting algorithm trained on TF binding location data. We also developed three independent, genome-scale evaluation metrics. By these metrics, NetProphet3 is more accurate than other commonly used packages, including NetProphet 2.0, when gene expression data from direct TF perturbations are available. Furthermore, its integration mode can forge a consensus network from gene expression data and TF binding location data. Availability and implementation All data and code are available at https://zenodo.org/record/7504131#.Y7Wu3i-B2x8. Supplementary information Supplementary data are available at Bioinformatics online.


Fig. S9 (A-B) Performance of NP3-10CV using either TFKO or ZEV dataset. (A) Area Under the Curve (AUROC) and (B) Area Under of Precision Recall (AUPRC).
The AUROC and AUPRC are calculated using binding data. The performance of NP3-10CV using combined TFKO and ZEV datasets is better than using either of TFKO and ZEV dataset.

NetProphet3
In the experiments reported here, we used NP3 v1.0. The latest version of NetProphet3 can always be found on GitHub https://github.com/BrentLab/NetProphet_3.0. The training labels and input data are provided as supplemental files S1-4. NP3 combines four weighted networks LASSO, DE, BART, and PWM. Each one of them is inferred by different intermediate software. For LASSO, we used 10-fold CV to select lambda with the option of one lambda used for all targets. We used the R packages: Lars v1.2 for LASSO, BayesTrees v0.3.1.4 for BART and FIRE v1.1 for PWM as described in the NetProphet2.0 paper (Kang, et al., 2017). We used the weighted network output by BART as the input to FIRE. Learned PWMs were scanned over promoters using FIMO from the MEME suite. NP3 used packages XGboost v1.3.2.1 (Chen and Guestrin, 2016) and mlr v2.19.0 (Bischl, et al., 2016) for training and tuning hyperparameters.

Genie3
We used Genie3 available in this GitHub page https://github.com/vahuynh/GENIE3, we ran the R implementation with default parameters.

Spearman co-expression network
We calculated the spearman correlation between the expression profiles of each target gene -the first vector -and each regulator (TF) -the second vector. Each of the spearman correlation values is the score of the edge of that target gene and that TF.

Gene-expression profiles
We used two gene-expression datasets for Saccharomyces cerevisiae in which each TF was perturbed either by deletion (TFKO) (Kemmeren, et al., 2014) or overexpression (ZEV) (Hackett, et al., 2020). The TFKO dataset that we downloaded from http://deleteome.holstegelab.nl/data/downloads/deleteome_all_mutants_controls.txt and includes 1485 gene expression profiles of strains in which a single gene was deleted from the genome. In 281 of these strains a TF was deleted. We used the TF perturbation profiles to construct the DE features by replacing fold changes between -1.3 and 1.3 by 0. Values from the original dataset were negated so that positive values correspond to an activating edge and negative values correspond to a repressive edge. We used all 1485 gene expression profiles, unmodified, as input to LASSO and BART. We used the column labeled log2_cleaned_ratio of the ZEV downloaded file https://storage.googleapis.com/calico-website-pin-publicbucket/datasets/pin_tall_expression_data.zip. This dataset contains gene expression profiles at various time points after transient induction of 167 TFs using estradiol. Gene expression was measured at various time points after induction. To construct DE features, we used shrunken log2 foldchange data from 15-minutes after induction and replaced fold changes between -1.3 and 1.3 by 0. As input to LASSO, DE, and other regression algorithms, we used 591 gene-expression profiles time points 15, 45 or 90 minutes after induction. All inputs are provided as supplemental files S1-3.

Evaluation metrics
Code for evaluation metrics: binding, GO, GO-directness and PPI is available in one package and can be found on GitHub https://github.com/BrentLab/NET-evaluation. We note that the GO evaluation metric is used as follow. For each TF's targets, we did GO enrichment analysis using GO-Term-Finder v0.86 (Boyle et al. 2004). We downloaded GO biological process terms from the gene ontology website http://geneontology.org/. We used the downloaded files to replace the annotations that came with the GO-Term-Finder v0.86. We provided these files supplemental files S6-7. 7. Overfitting a TF-specific model with a randomized feature set